R Data Wrangling: dplyr

Business Scenario

Suppose you are a data consultant recently hired by Company ABC, a mid-size company with 119 employees, to advise on global compensation for analytics positions throughout the company. Currently, the company has three offices: one in Columbus, Ohio, one in Toronto, Canada, and another in Munich, Germany and is shifting from a partially remote set-up to a fully remote one. Company ABC was able to obtain a listing of 607 salaries over 3 years on various technical positions from around the world to use as a benchmark and would like to understand what salaries look like in different locations.

Data Overview

Column Description
work_year The year the salary was paid.
experience_level The experience level in the job during the year with the following possible values: EN Entry-level / Junior MI Mid-level / Intermediate SE Senior-level / Expert EX Executive-level / Director
employment_type The type of employment for the role: PT Part-time FT Full-time CT Contract FL Freelance
job_title The role worked in during the year.
salary The total gross salary amount paid.
salary_currency The currency of the salary paid as an ISO 4217 currency code.
salary_in_usd The salary in USD (FX rate divided by avg. USD rate for the respective year via fxdata.foorilla.com).
employee_residence Employee’s primary country of residence in during the work year as an ISO 3166 country code.
remote_ratio The overall amount of work done remotely, possible values are as follows: 0 No remote work (less than 20%) 50 Partially remote 100 Fully remote (more than 80%)
company_location The country of the employer’s main office or contracting branch as an ISO 3166 country code.
company_size The average number of people that worked for the company during the year: S less than 50 employees (small) M 50 to 250 employees (medium) L more than 250 employees (large)

Read CSV Data into R

Let’s read in our sample data ds_salaries.csv into a data frame, sal, and, again, subset the data. This time we’ll save a data frame, salsub that is the first 30 rows of the data frame, sal.

Load the dplyr package

If dplyr is not installed, first install with the code install.packages('dplyr').

Select

Select the salary using the select() function in the dplyr package. Note that columns can be specified without the $ that’s needed in base R syntax.

Use piping (%>% or |>) to select the salary column. Piping will take the first argument (which is a data frame) and “pipe” it into the function.

TIP: Use the keys Ctrl + Shift + M to create the pipe, %>%

Exercise 1

Select the work_year, experience_level, salary_currency, and salary columns.

Solution:
1salsub %>%
2    select(work_year, experience_level, salary_currency, salary)
1
Start with the data frame, salsub, and pipe %>% to the next function.
2
Select the appropriate columns. Note that unlike in base R, salsub$ before the column name is not needed. Why? salsub was “piped” into the first argument of the select() function.

Alternative solution:

select(salsub, work_year, experience_level, salary_currency, salary)

This solution does not use a pipe, so the data frame salsub must be specified in the first argument.

Exercise 2

Select the work_year, experience_level, salary_currency, and salary columns and then show only the first 6 rows.

Solution:
1salsub %>%
2    select(work_year, experience_level, salary_currency, salary) %>%
3    head()
1
Start with the data frame, salsub, and pipe %>% to the next function.
2
Select the appropriate columns.
3
Return only the first 6 rows with the head() function. Why is it unnecessary to specify 6?

Alternative solution:

head(select(salsub, work_year, experience_level, salary_currency, salary))

This solution does not use a pipe, so the data frame salsub must be specified in the first argument. Notice that the most recent function calls surround the prior function calls.

Filter

Filter the data with the criteria, company_location is equal to US, using the filter() function in the dplyr package.

Without piping:

With piping:

Change the filter to include company locations of either US or AU using the %in% function.

Tip: To add multiple filter conditions separate with & for requiring both conditions to be met or | to require at least one of the conditions to be met.

Exercise 3

Filter the data to show rows where experience_level is equal to senior level AND salary_in_usd is greater than $100k.

Solution:
1salsub %>%
2    filter(experience_level == 'SE' & salary_in_usd > 100000)
1
Start with the data frame, salsub, and pipe %>% to the next function.
2
Filter on experience_level and salary_in_usd.

Exercise 4

Filter the data to show rows where experience_level is equal to senior level OR salary_in_usd is greater than $100k.

Solution:
1salsub %>%
2    filter(experience_level == 'SE' | salary_in_usd > 100000)
1
Start with the data frame, salsub, and pipe %>% to the next function.
2
Filter on experience_level or salary_in_usd.

Exercise 5

Filter the data to show the rows where salary_in_usd is blank (i.e. is NA).

Solution:
1salsub %>%
2    filter(is.na(salary_in_usd))
1
Start with the data frame, salsub, and pipe %>% to the next function.
2
Filter where salary_in_usd is blank using the is.na() function.

Arrange

Use arrange() to order the data by a variable in either ascending or descending order.

Ascending:

Descending:

Exercise 6

Filter the data where salary_currency is equal to USD, arrange salary in descending order, and select only the work_year, job_title, salary columns.

Solution:
1salsub %>%
2    filter(salary_currency == 'USD') %>%
3    arrange(desc(salary)) %>%
4    select(work_year, job_title, salary)
1
Start with the data frame, salsub, and pipe %>% to the next function.
2
Filter where salary_currence is equal to USD.
3
Arrange salary in descending order.
4
Select appropriate columns.

Mutate

mutate() adds another column to the data.

Make a copy of another column.

Create a new column based on an if/else statement.

Create a new column based on a case when statement.

Exercise 7

Create a new column, USD_or_not, that shows “Yes” if salary_currency is equal to USD and “No” for all other currencies. Select the salary, salary_currency, and USD_or_not columns.

Solution:
1salsub %>%
2    mutate(USD_or_not = ifelse(salary_currency == "USD", "Yes", "No")) %>%
3    select(salary, salary_currency, USD_or_not)
1
Start with the data frame, salsub, and pipe %>% to the next function.
2
Create a new column using an ifelse statement.
3
Select appropriate columns.

Summarize

Use summarize() (or summarise) to create aggregate variables. If you’d like summaries to be grouped by other variables, first specify these variables with the group_by statement.

Let’s find the average salary_in_usd across our full dataset, sal.

Let’s try again making sure we account for NAs.

Let’s look at the average salaries for each currency and the number of rows in each.

Exercise 8

Find the average salary using the sal dataset where salary_currency is equal to USD and grouping by work_year.

Solution:
1sal %>%
2    filter(salary_currency == 'USD') %>%
3    group_by(work_year) %>%
4    summarise(avg_usd_salary = mean(salary))
1
Start with the data frame, sal, and pipe %>% to the next function.
2
Filter on USD salaries.
3
Group by work_year.
4
Find the average salary.